Data Analysis Final Project¶
Due Date: January 29, 2026, 23:59
Group: Cuircuit Synergy
Created By: Jeremia Baumgartner, Lorenz Buchinger, Tim Zwölfer
Table of contents
- A. Data Preprocessing and Data Quality (70 points)
- B. Visualization and Exploratory Analysis (55 points)
- C. Probability and Event Analysis (45 points)
- D. Statistical Theory Applications (45 points)
- E. Regression and Predictive Modeling (45 points)
- F. Dimensionality Reduction and Statistical Tests (40 points)
Initial Setup
# Initial setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Configure plotting
plt.rcParams.update({
'figure.figsize': [6, 4],
'figure.dpi': 150,
'figure.autolayout': True,
'axes.labelsize': 8,
'axes.titlesize': 10,
'font.size': 6,
'xtick.labelsize': 6,
'ytick.labelsize': 6
})
sns.set_style("whitegrid")
sns.set_context("notebook", font_scale=1.2)
np.random.seed(42)
Load Data
df_csv_data = pd.read_csv('traffic_accidents.csv')
df_csv_data_raw = df_csv_data # Keep a raw copy
A. Data Preprocessing and Data Quality (70 points)¶
Assigned to Lorenz
- Dataset overview (dimensions, columns, types, time range, sampling rate, missingness summary) (10 points)
- Basic statistical analysis using pandas (descriptives, grouped stats, quantiles) (10 points)
- Original data quality analysis with visualization (missingness patterns, outliers, dupli- cates, timestamp gaps, inconsistent units) (20 points)
- Data preprocessing pipeline (cleaning steps, handling missing data, outliers strategy, re- sampling or alignment if needed, feature engineering basics) (20 points)
- Preprocessed vs original comparison (before/after visuals plus commentary on what changed and why) (10 points)
Dataset Overview¶
Dimensions
print("The dataset contains", df_csv_data.shape[0], "rows and", df_csv_data.shape[1], "columns.")
The dataset contains 209306 rows and 24 columns.
Columns and types
schema_table = (
df_csv_data.dtypes
.reset_index()
.rename(columns={"index": "column", 0: "dtype"})
)
schema_table
| column | dtype | |
|---|---|---|
| 0 | crash_date | object |
| 1 | traffic_control_device | object |
| 2 | weather_condition | object |
| 3 | lighting_condition | object |
| 4 | first_crash_type | object |
| 5 | trafficway_type | object |
| 6 | alignment | object |
| 7 | roadway_surface_cond | object |
| 8 | road_defect | object |
| 9 | crash_type | object |
| 10 | intersection_related_i | object |
| 11 | damage | object |
| 12 | prim_contributory_cause | object |
| 13 | num_units | int64 |
| 14 | most_severe_injury | object |
| 15 | injuries_total | float64 |
| 16 | injuries_fatal | float64 |
| 17 | injuries_incapacitating | float64 |
| 18 | injuries_non_incapacitating | float64 |
| 19 | injuries_reported_not_evident | float64 |
| 20 | injuries_no_indication | float64 |
| 21 | crash_hour | int64 |
| 22 | crash_day_of_week | int64 |
| 23 | crash_month | int64 |
Time Range
first_date = df_csv_data["crash_date"].min()
last_date = df_csv_data["crash_date"].max()
print("First entry date:", first_date)
print("Last entry date:", last_date)
First entry date: 01/01/2016 01:03:00 AM Last entry date: 12/31/2024 12:55:00 PM
Sampling Rate
There is no constant sampling rate since car accidents don't happen in fixed intervals.
Missingness Summary
Missing values in the dataset are encoded using the string 'UNKNOWN' rather than null values.
num_values_total = df_csv_data_raw.shape[0] * df_csv_data_raw.shape[1]
print("Total number of values in the dataset:", num_values_total)
unknown_total = (df_csv_data_raw == "UNKNOWN").sum().sum()
print("Total UNKNOWN values:", unknown_total)
unknown_percentage = (unknown_total / num_values_total) * 100
print(f"Percentage of UNKNOWN values: {unknown_percentage:.2f}%")
Total number of values in the dataset: 5023344 Total UNKNOWN values: 63320 Percentage of UNKNOWN values: 1.26%
missing_summary = pd.DataFrame({
"missing_count": (df_csv_data_raw == "UNKNOWN").sum(),
"missing_percent": ((df_csv_data_raw == "UNKNOWN").mean() * 100).round(2)
}).sort_values(by="missing_percent", ascending=False)
missing_summary
| missing_count | missing_percent | |
|---|---|---|
| road_defect | 34426 | 16.45 |
| roadway_surface_cond | 12509 | 5.98 |
| weather_condition | 6534 | 3.12 |
| traffic_control_device | 4455 | 2.13 |
| lighting_condition | 4336 | 2.07 |
| trafficway_type | 1060 | 0.51 |
| crash_date | 0 | 0.00 |
| injuries_total | 0 | 0.00 |
| crash_day_of_week | 0 | 0.00 |
| crash_hour | 0 | 0.00 |
| injuries_no_indication | 0 | 0.00 |
| injuries_reported_not_evident | 0 | 0.00 |
| injuries_non_incapacitating | 0 | 0.00 |
| injuries_incapacitating | 0 | 0.00 |
| injuries_fatal | 0 | 0.00 |
| prim_contributory_cause | 0 | 0.00 |
| most_severe_injury | 0 | 0.00 |
| num_units | 0 | 0.00 |
| damage | 0 | 0.00 |
| intersection_related_i | 0 | 0.00 |
| crash_type | 0 | 0.00 |
| alignment | 0 | 0.00 |
| first_crash_type | 0 | 0.00 |
| crash_month | 0 | 0.00 |
Basic Statistical Analysis Using Pandas (descriptives, grouped stats, quantiles)¶
Descriptive Analysis
Qantiles, min., max., mean value and standard deviation of all columns.
df_csv_data = df_csv_data.replace("UNKNOWN", np.nan)
df_csv_data.describe()
| num_units | injuries_total | injuries_fatal | injuries_incapacitating | injuries_non_incapacitating | injuries_reported_not_evident | injuries_no_indication | crash_hour | crash_day_of_week | crash_month | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 209306.000000 | 209306.000000 | 209306.000000 | 209306.000000 | 209306.000000 | 209306.000000 | 209306.000000 | 209306.000000 | 209306.000000 | 209306.000000 |
| mean | 2.063300 | 0.382717 | 0.001859 | 0.038102 | 0.221241 | 0.121516 | 2.244002 | 13.373047 | 4.144024 | 6.771822 |
| std | 0.396012 | 0.799720 | 0.047502 | 0.233964 | 0.614960 | 0.450865 | 1.241175 | 5.603830 | 1.966864 | 3.427593 |
| min | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 |
| 25% | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 9.000000 | 2.000000 | 4.000000 |
| 50% | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 14.000000 | 4.000000 | 7.000000 |
| 75% | 2.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 | 17.000000 | 6.000000 | 10.000000 |
| max | 11.000000 | 21.000000 | 3.000000 | 7.000000 | 21.000000 | 15.000000 | 49.000000 | 23.000000 | 7.000000 | 12.000000 |
# Quantiles for numeric columns only to avoid string columns
numeric_cols = df_csv_data.select_dtypes(include="number")
if numeric_cols.empty:
raise ValueError("Keine numerischen Spalten gefunden.")
quantiles = numeric_cols.quantile([0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99])
quantiles
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[55], line 1 ----> 1 quantiles = df_csv_data.quantile([0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99]) 2 quantiles File c:\Users\jerem\miniconda3\envs\jupyter311\Lib\site-packages\pandas\core\frame.py:12218, in DataFrame.quantile(self, q, axis, numeric_only, interpolation, method) 12214 raise ValueError( 12215 f"Invalid method: {method}. Method must be in {valid_method}." 12216 ) 12217 if method == "single": > 12218 res = data._mgr.quantile(qs=q, interpolation=interpolation) 12219 elif method == "table": 12220 valid_interpolation = {"nearest", "lower", "higher"} File c:\Users\jerem\miniconda3\envs\jupyter311\Lib\site-packages\pandas\core\internals\managers.py:1567, in BlockManager.quantile(self, qs, interpolation) 1564 new_axes = list(self.axes) 1565 new_axes[1] = Index(qs, dtype=np.float64) -> 1567 blocks = [ 1568 blk.quantile(qs=qs, interpolation=interpolation) for blk in self.blocks 1569 ] 1571 return type(self)(blocks, new_axes) File c:\Users\jerem\miniconda3\envs\jupyter311\Lib\site-packages\pandas\core\internals\managers.py:1568, in <listcomp>(.0) 1564 new_axes = list(self.axes) 1565 new_axes[1] = Index(qs, dtype=np.float64) 1567 blocks = [ -> 1568 blk.quantile(qs=qs, interpolation=interpolation) for blk in self.blocks 1569 ] 1571 return type(self)(blocks, new_axes) File c:\Users\jerem\miniconda3\envs\jupyter311\Lib\site-packages\pandas\core\internals\blocks.py:1957, in Block.quantile(self, qs, interpolation) 1954 assert self.ndim == 2 1955 assert is_list_like(qs) # caller is responsible for this -> 1957 result = quantile_compat(self.values, np.asarray(qs._values), interpolation) 1958 # ensure_block_shape needed for cases where we start with EA and result 1959 # is ndarray, e.g. IntegerArray, SparseArray 1960 result = ensure_block_shape(result, ndim=2) File c:\Users\jerem\miniconda3\envs\jupyter311\Lib\site-packages\pandas\core\array_algos\quantile.py:39, in quantile_compat(values, qs, interpolation) 37 fill_value = na_value_for_dtype(values.dtype, compat=False) 38 mask = isna(values) ---> 39 return quantile_with_mask(values, mask, fill_value, qs, interpolation) 40 else: 41 return values._quantile(qs, interpolation) File c:\Users\jerem\miniconda3\envs\jupyter311\Lib\site-packages\pandas\core\array_algos\quantile.py:97, in quantile_with_mask(values, mask, fill_value, qs, interpolation) 95 result = np.repeat(flat, len(values)).reshape(len(values), len(qs)) 96 else: ---> 97 result = _nanpercentile( 98 values, 99 qs * 100.0, 100 na_value=fill_value, 101 mask=mask, 102 interpolation=interpolation, 103 ) 105 result = np.asarray(result) 106 result = result.T File c:\Users\jerem\miniconda3\envs\jupyter311\Lib\site-packages\pandas\core\array_algos\quantile.py:198, in _nanpercentile(values, qs, na_value, mask, interpolation) 195 if mask.any(): 196 # Caller is responsible for ensuring mask shape match 197 assert mask.shape == values.shape --> 198 result = [ 199 _nanpercentile_1d(val, m, qs, na_value, interpolation=interpolation) 200 for (val, m) in zip(list(values), list(mask)) 201 ] 202 if values.dtype.kind == "f": 203 # preserve itemsize 204 result = np.asarray(result, dtype=values.dtype).T File c:\Users\jerem\miniconda3\envs\jupyter311\Lib\site-packages\pandas\core\array_algos\quantile.py:199, in <listcomp>(.0) 195 if mask.any(): 196 # Caller is responsible for ensuring mask shape match 197 assert mask.shape == values.shape 198 result = [ --> 199 _nanpercentile_1d(val, m, qs, na_value, interpolation=interpolation) 200 for (val, m) in zip(list(values), list(mask)) 201 ] 202 if values.dtype.kind == "f": 203 # preserve itemsize 204 result = np.asarray(result, dtype=values.dtype).T File c:\Users\jerem\miniconda3\envs\jupyter311\Lib\site-packages\pandas\core\array_algos\quantile.py:145, in _nanpercentile_1d(values, mask, qs, na_value, interpolation) 139 if len(values) == 0: 140 # Can't pass dtype=values.dtype here bc we might have na_value=np.nan 141 # with values.dtype=int64 see test_quantile_empty 142 # equiv: 'np.array([na_value] * len(qs))' but much faster 143 return np.full(len(qs), na_value) --> 145 return np.percentile( 146 values, 147 qs, 148 # error: No overload variant of "percentile" matches argument 149 # types "ndarray[Any, Any]", "ndarray[Any, dtype[floating[_64Bit]]]" 150 # , "Dict[str, str]" [call-overload] 151 method=interpolation, # type: ignore[call-overload] 152 ) File c:\Users\jerem\miniconda3\envs\jupyter311\Lib\site-packages\numpy\lib\_function_base_impl.py:4292, in percentile(a, q, axis, out, overwrite_input, method, keepdims, weights, interpolation) 4289 if np.any(weights < 0): 4290 raise ValueError("Weights must be non-negative.") -> 4292 return _quantile_unchecked( 4293 a, q, axis, out, overwrite_input, method, keepdims, weights) File c:\Users\jerem\miniconda3\envs\jupyter311\Lib\site-packages\numpy\lib\_function_base_impl.py:4569, in _quantile_unchecked(a, q, axis, out, overwrite_input, method, keepdims, weights) 4560 def _quantile_unchecked(a, 4561 q, 4562 axis=None, (...) 4566 keepdims=False, 4567 weights=None): 4568 """Assumes that q is in [0, 1], and is an ndarray""" -> 4569 return _ureduce(a, 4570 func=_quantile_ureduce_func, 4571 q=q, 4572 weights=weights, 4573 keepdims=keepdims, 4574 axis=axis, 4575 out=out, 4576 overwrite_input=overwrite_input, 4577 method=method) File c:\Users\jerem\miniconda3\envs\jupyter311\Lib\site-packages\numpy\lib\_function_base_impl.py:3914, in _ureduce(a, func, keepdims, **kwargs) 3911 index_out = (0, ) * nd 3912 kwargs['out'] = out[(Ellipsis, ) + index_out] -> 3914 r = func(a, **kwargs) 3916 if out is not None: 3917 return out File c:\Users\jerem\miniconda3\envs\jupyter311\Lib\site-packages\numpy\lib\_function_base_impl.py:4744, in _quantile_ureduce_func(a, q, weights, axis, out, overwrite_input, method) 4742 arr = a.copy() 4743 wgt = weights -> 4744 result = _quantile(arr, 4745 quantiles=q, 4746 axis=axis, 4747 method=method, 4748 out=out, 4749 weights=wgt) 4750 return result File c:\Users\jerem\miniconda3\envs\jupyter311\Lib\site-packages\numpy\lib\_function_base_impl.py:4876, in _quantile(arr, quantiles, axis, method, out, weights) 4874 result_shape = virtual_indexes.shape + (1,) * (arr.ndim - 1) 4875 gamma = gamma.reshape(result_shape) -> 4876 result = _lerp(previous, 4877 next, 4878 gamma, 4879 out=out) 4880 else: 4881 # Weighted case 4882 # This implements method="inverted_cdf", the only supported weighted 4883 # method, which needs to sort anyway. 4884 weights = np.asanyarray(weights) File c:\Users\jerem\miniconda3\envs\jupyter311\Lib\site-packages\numpy\lib\_function_base_impl.py:4671, in _lerp(a, b, t, out) 4657 def _lerp(a, b, t, out=None): 4658 """ 4659 Compute the linear interpolation weighted by gamma on each point of 4660 two same shape array. (...) 4669 Output array. 4670 """ -> 4671 diff_b_a = subtract(b, a) 4672 # asanyarray is a stop-gap until gh-13105 4673 lerp_interpolation = asanyarray(add(a, diff_b_a * t, out=out)) TypeError: unsupported operand type(s) for -: 'str' and 'str'
Grouped Stats
# Accidents per weekday
# Count number of accidents per weekday
weekday_counts = df_csv_data['crash_day_of_week'].value_counts().sort_index()
# Plot
# Map numeric days to short names (assumes 1=Sunday, 2=Monday, ..., 7=Saturday)
day_map = {1: "Sun", 2: "Mon", 3: "Tue", 4: "Wed", 5: "Thu", 6: "Fri", 7: "Sat"}
wc = weekday_counts.reset_index()
wc.columns = ["day_num", "count"]
wc["day_name"] = wc["day_num"].map(day_map)
# Order Monday -> Sunday
order = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
wc["day_name"] = pd.Categorical(wc["day_name"], categories=order, ordered=True)
wc = wc.sort_values("day_name")
plt.figure(figsize=(8, 4))
ax = sns.barplot(data=wc, x="day_name", y="count", palette="viridis")
ax.set_xlabel("Day of Week")
ax.set_ylabel("Number of Accidents")
ax.set_title("Accidents per Weekday", fontsize=12)
ax.grid(axis="y", linestyle="--", alpha=0.6)
plt.tight_layout()
plt.show()
C:\Users\jerem\AppData\Local\Temp\ipykernel_47636\3204367320.py:18: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax = sns.barplot(data=wc, x="day_name", y="count", palette="viridis")
# Nicely styled barplot for total injuries per weekday
weekday_injuries = df_csv_data.groupby("crash_day_of_week")["injuries_total"].sum().sort_index()
day_map = {1: "Sun", 2: "Mon", 3: "Tue", 4: "Wed", 5: "Thu", 6: "Fri", 7: "Sat"}
wc = weekday_injuries.reset_index()
wc.columns = ["day_num", "injuries"]
wc["day_name"] = wc["day_num"].map(day_map)
# Order Monday -> Sunday
order = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
wc["day_name"] = pd.Categorical(wc["day_name"], categories=order, ordered=True)
wc = wc.sort_values("day_name")
plt.figure(figsize=(8, 4))
ax = sns.barplot(data=wc, x="day_name", y="injuries", palette="viridis")
ax.set_xlabel("Day of Week")
ax.set_ylabel("Total Number of Injuries")
ax.set_title("Total Number of Injuries per Weekday", fontsize=12)
ax.grid(axis="y", linestyle="--", alpha=0.6)
plt.tight_layout()
plt.show()
C:\Users\jerem\AppData\Local\Temp\ipykernel_47636\1750550436.py:14: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax = sns.barplot(data=wc, x="day_name", y="injuries", palette="viridis")
# Nicely styled weather condition distribution (top categories + "Other")
vc = df_csv_data["weather_condition"].fillna("UNKNOWN").value_counts()
top_n = 10
if len(vc) > top_n:
top = vc.head(top_n).copy()
top["Other"] = vc.iloc[top_n:].sum()
else:
top = vc.copy()
df_plot = top.reset_index()
df_plot.columns = ["weather", "count"]
df_plot["pct"] = df_plot["count"] / df_plot["count"].sum() * 100
plt.figure(figsize=(10, 5))
ax = sns.barplot(data=df_plot, x="weather", y="count", palette="mako")
ax.set_title("Weather Condition Distribution", fontsize=12)
ax.set_xlabel("Weather Condition")
ax.set_ylabel("Number of Accidents")
ax.tick_params(axis="x", rotation=45, labelsize=8)
ax.tick_params(axis="y", labelsize=8)
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, pos: f"{int(x):,}"))
# Annotate bars with count and percentage
max_count = df_plot["count"].max()
for p, cnt, pct in zip(ax.patches, df_plot["count"], df_plot["pct"]):
ax.text(p.get_x() + p.get_width() / 2, p.get_height() + max_count * 0.01,
f"{int(cnt):,}\n({pct:.1f}%)", ha="center", va="bottom", fontsize=8)
plt.grid(axis="y", linestyle="--", alpha=0.6)
sns.despine(trim=True)
plt.tight_layout()
plt.show()
C:\Users\jerem\AppData\Local\Temp\ipykernel_47636\1009555762.py:15: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax = sns.barplot(data=df_plot, x="weather", y="count", palette="mako")
# Analysis of correlation between injuries and weather condition
df_csv_data.groupby("weather_condition")["injuries_total"].agg(
["count", "mean", "median", "std"]
)
| count | mean | median | std | |
|---|---|---|---|---|
| weather_condition | ||||
| BLOWING SAND, SOIL, DIRT | 1 | 0.000000 | 0.0 | NaN |
| BLOWING SNOW | 127 | 0.314961 | 0.0 | 0.752781 |
| CLEAR | 164700 | 0.390401 | 0.0 | 0.812737 |
| CLOUDY/OVERCAST | 7533 | 0.369308 | 0.0 | 0.745190 |
| FOG/SMOKE/HAZE | 360 | 0.444444 | 0.0 | 0.829285 |
| FREEZING RAIN/DRIZZLE | 510 | 0.460784 | 0.0 | 0.864851 |
| OTHER | 627 | 0.483254 | 0.0 | 0.821607 |
| RAIN | 21703 | 0.411510 | 0.0 | 0.808552 |
| SEVERE CROSS WIND GATE | 32 | 0.250000 | 0.0 | 0.762001 |
| SLEET/HAIL | 308 | 0.461039 | 0.0 | 0.851388 |
| SNOW | 6871 | 0.308543 | 0.0 | 0.698708 |
# Count crashes per hour and plot
hour_counts = df_csv_data["crash_hour"].dropna().astype(int).value_counts().sort_index()
# Ensure hours 0-23 are present
hours = pd.Series(0, index=range(24))
hour_counts = hours.add(hour_counts, fill_value=0).astype(int)
df_hour = hour_counts.reset_index()
df_hour.columns = ["hour", "count"]
plt.figure(figsize=(12, 4))
ax = sns.barplot(data=df_hour, x="hour", y="count", palette="viridis")
ax.set_xlabel("Crash Hour (0-23)")
ax.set_ylabel("Number of Crashes")
ax.set_title("Count of Crashes by Hour of Day")
ax.set_xticks(range(24))
ax.set_xticklabels([str(h) for h in range(24)])
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, pos: f"{int(x):,}"))
# Annotate bars
max_count = df_hour["count"].max()
for p, cnt in zip(ax.patches, df_hour["count"]):
ax.text(p.get_x() + p.get_width() / 2, p.get_height() + max_count * 0.01,
f"{int(cnt):,}", ha="center", va="bottom", fontsize=8)
plt.tight_layout()
plt.show()
C:\Users\jerem\AppData\Local\Temp\ipykernel_47636\3252361176.py:12: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax = sns.barplot(data=df_hour, x="hour", y="count", palette="viridis")
Original data quality analysis with visualization¶
Missingness Patterns
missing_counts = df_csv_data.isna().sum().sort_values(ascending=False)
missing_counts[missing_counts > 0]
road_defect 34426 roadway_surface_cond 12509 weather_condition 6534 traffic_control_device 4455 lighting_condition 4336 trafficway_type 1060 dtype: int64
cols = [
"road_defect",
"roadway_surface_cond",
"weather_condition",
"traffic_control_device",
"lighting_condition",
"trafficway_type"
]
# Ensure datetime
df = df_csv_data.copy()
df["crash_date"] = pd.to_datetime(df["crash_date"], format='%m/%d/%Y %I:%M:%S %p')
# Keep only relevant columns
df = df[["crash_date"] + cols]
# Create monthly period
df["month"] = df["crash_date"].dt.to_period("M")
# Calculate fraction missing per month per column
monthly_missing = (
df.groupby("month")[cols]
.apply(lambda x: x.isna().mean())
)
# Convert PeriodIndex to timestamp for plotting
monthly_missing.index = monthly_missing.index.to_timestamp()
# Plot heatmap
plt.figure(figsize=(16, 6))
ax = sns.heatmap(
monthly_missing.T,
cmap="rocket_r",
cbar_kws={"label": "Fraction Missing"},
linewidths=0.3
)
# Year-only x-axis ticks
months = monthly_missing.index
year_ticks = [i for i, d in enumerate(months) if d.month == 1]
year_labels = [d.year for d in months if d.month == 1]
ax.set_xticks(year_ticks)
ax.set_xticklabels(year_labels, rotation=0)
plt.title("Monthly Missing Data Fraction by Column")
plt.xlabel("Year")
plt.ylabel("Column")
plt.tight_layout()
plt.show()
# Prepare data
col_names = [
"injuries_non_incapacitating",
"injuries_incapacitating",
"injuries_reported_not_evident",
"injuries_no_indication",
"injuries_fatal",
"injuries_total"
]
selected = df_csv_data[col_names]
# Zero fraction (ignoring NaNs)
non_na_counts = selected.notna().sum()
zero_frac = (selected == 0).sum() / non_na_counts
# Melt non-zero values and add log1p transform
df_melted = selected.melt(var_name="Variable", value_name="Value")
df_nonzero = df_melted[df_melted["Value"] > 0].copy()
df_nonzero["log1p"] = np.log1p(df_nonzero["Value"])
pretty_labels = [
"Non-incapacitating",
"Incapacitating",
"Reported (not evident)",
"No indication",
"Fatal",
"Total"
]
# Plot 1: Zero fraction
plt.figure(figsize=(10, 5))
ax = sns.barplot(x=zero_frac.index, y=zero_frac.values * 100, palette="muted")
ax.set_ylabel("Percentage of zeros (%)")
ax.set_xlabel("")
ax.set_title("Zero fraction per variable")
ax.set_xticklabels(pretty_labels, rotation=45, ha="right")
plt.tight_layout()
plt.show()
# Plot 2: Boxplot on log1p of non-zero values
plt.figure(figsize=(10, 6))
ax = sns.boxplot(data=df_nonzero, x="Variable", y="log1p",
showcaps=True, showfliers=True, palette="vlag", order=col_names)
ax.set_ylabel("log1p(Value)")
ax.set_title("Boxplot of non-zero values (log1p)")
ax.set_xticklabels(pretty_labels, rotation=45, ha="right")
plt.tight_layout()
plt.show()
C:\Users\jerem\AppData\Local\Temp\ipykernel_47636\1360554475.py:33: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax = sns.barplot(x=zero_frac.index, y=zero_frac.values * 100, palette="muted") C:\Users\jerem\AppData\Local\Temp\ipykernel_47636\1360554475.py:37: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator. ax.set_xticklabels(pretty_labels, rotation=45, ha="right")
C:\Users\jerem\AppData\Local\Temp\ipykernel_47636\1360554475.py:43: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax = sns.boxplot(data=df_nonzero, x="Variable", y="log1p", C:\Users\jerem\AppData\Local\Temp\ipykernel_47636\1360554475.py:47: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator. ax.set_xticklabels(pretty_labels, rotation=45, ha="right")
outlier_counts = {}
for col in col_names:
Q1 = df_csv_data[col].quantile(0.25)
Q3 = df_csv_data[col].quantile(0.75)
IQR = Q3 - Q1
outliers = (
(df_csv_data[col] < Q1 - 1.5 * IQR) |
(df_csv_data[col] > Q3 + 1.5 * IQR)
)
outlier_counts[col] = outliers.sum()
outlier_df = (
pd.DataFrame.from_dict(
outlier_counts,
orient="index",
columns=["Outlier Count"]
)
.sort_values("Outlier Count", ascending=False)
)
plt.figure(figsize=(12,5))
sns.barplot(
data=outlier_df.reset_index(),
x="index",
y="Outlier Count"
)
plt.xticks(rotation=45, ha="right")
plt.title("Number of Outliers per Injury-Related Variables (IQR Method)")
plt.tight_layout()
plt.show()
# Check for typos, different capitalizations, or inconsistent entries in categorical columns
for col in df_csv_data.select_dtypes(include="object"):
print(col)
print(df_csv_data[col].value_counts().head())
print("-"*40)
# No typos or inconsistencies found in categorical columns
crash_date crash_date 12/29/2020 05:00:00 PM 10 02/17/2022 03:30:00 PM 8 09/11/2019 04:30:00 PM 6 01/12/2019 02:30:00 PM 6 11/26/2018 08:30:00 AM 6 Name: count, dtype: int64 ---------------------------------------- traffic_control_device traffic_control_device TRAFFIC SIGNAL 123944 STOP SIGN/FLASHER 49139 NO CONTROLS 29508 OTHER 670 YIELD 468 Name: count, dtype: int64 ---------------------------------------- weather_condition weather_condition CLEAR 164700 RAIN 21703 CLOUDY/OVERCAST 7533 SNOW 6871 OTHER 627 Name: count, dtype: int64 ---------------------------------------- lighting_condition lighting_condition DAYLIGHT 134109 DARKNESS, LIGHTED ROAD 53378 DARKNESS 7436 DUSK 6323 DAWN 3724 Name: count, dtype: int64 ---------------------------------------- first_crash_type first_crash_type TURNING 64157 ANGLE 52250 REAR END 42018 SIDESWIPE SAME DIRECTION 20116 PEDESTRIAN 8996 Name: count, dtype: int64 ---------------------------------------- trafficway_type trafficway_type NOT DIVIDED 77753 FOUR WAY 49057 DIVIDED - W/MEDIAN (NOT RAISED) 34221 ONE-WAY 12341 DIVIDED - W/MEDIAN BARRIER 10720 Name: count, dtype: int64 ---------------------------------------- alignment alignment STRAIGHT AND LEVEL 204590 STRAIGHT ON GRADE 2992 CURVE, LEVEL 1014 STRAIGHT ON HILLCREST 478 CURVE ON GRADE 179 Name: count, dtype: int64 ---------------------------------------- roadway_surface_cond roadway_surface_cond DRY 155905 WET 32908 SNOW OR SLUSH 6203 ICE 1303 OTHER 438 Name: count, dtype: int64 ---------------------------------------- road_defect road_defect NO DEFECTS 171730 WORN SURFACE 1000 OTHER 912 RUT, HOLES 741 SHOULDER DEFECT 358 Name: count, dtype: int64 ---------------------------------------- crash_type crash_type NO INJURY / DRIVE AWAY 117376 INJURY AND / OR TOW DUE TO CRASH 91930 Name: count, dtype: int64 ---------------------------------------- intersection_related_i intersection_related_i Y 199324 N 9982 Name: count, dtype: int64 ---------------------------------------- damage damage OVER $1,500 147313 $501 - $1,500 41210 $500 OR LESS 20783 Name: count, dtype: int64 ---------------------------------------- prim_contributory_cause prim_contributory_cause UNABLE TO DETERMINE 58316 FAILING TO YIELD RIGHT-OF-WAY 42914 FOLLOWING TOO CLOSELY 19084 DISREGARDING TRAFFIC SIGNALS 14591 IMPROPER TURNING/NO SIGNAL 12643 Name: count, dtype: int64 ---------------------------------------- most_severe_injury most_severe_injury NO INDICATION OF INJURY 154789 NONINCAPACITATING INJURY 31527 REPORTED, NOT EVIDENT 16075 INCAPACITATING INJURY 6564 FATAL 351 Name: count, dtype: int64 ----------------------------------------
Analysis of Timestamp Gaps
# Ensure datetime and sort
df = df_csv_data.copy()
df["crash_date"] = pd.to_datetime(df["crash_date"])
df = df.sort_values("crash_date")
# Calculate gaps between consecutive accidents
df["time_gap"] = df["crash_date"].diff()
# Drop the first row (NaT gap)
gaps = df.dropna(subset=["time_gap"])
# Filter gaps greater than 1 day
long_gaps = (
gaps[gaps["time_gap"] > pd.Timedelta(days=1)]
[["crash_date", "time_gap"]]
.rename(columns={"crash_date": "gap_end"})
)
# Add gap start for clarity
long_gaps["gap_start"] = long_gaps["gap_end"] - long_gaps["time_gap"]
# Optional: sort largest gaps first
long_gaps = long_gaps.sort_values("time_gap", ascending=False)
long_gaps
C:\Users\jerem\AppData\Local\Temp\ipykernel_47636\1628767530.py:3: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. df["crash_date"] = pd.to_datetime(df["crash_date"])
| gap_end | time_gap | gap_start | |
|---|---|---|---|
| 121162 | 2015-02-13 08:00:00 | 621 days 11:31:00 | 2013-06-01 20:29:00 |
| 132318 | 2015-05-25 23:38:00 | 101 days 15:38:00 | 2015-02-13 08:00:00 |
| 23119 | 2013-06-01 20:29:00 | 90 days 03:41:00 | 2013-03-03 16:48:00 |
| 106211 | 2015-08-02 19:55:00 | 68 days 20:17:00 | 2015-05-25 23:38:00 |
| 46652 | 2015-08-14 09:30:00 | 2 days 19:15:00 | 2015-08-11 14:15:00 |
| 42585 | 2015-08-06 10:00:00 | 1 days 19:30:00 | 2015-08-04 14:30:00 |
| 1958 | 2015-08-10 09:15:00 | 1 days 12:00:00 | 2015-08-08 21:15:00 |
| 15595 | 2015-08-17 01:11:00 | 1 days 10:11:00 | 2015-08-15 15:00:00 |
| 96371 | 2015-08-11 14:15:00 | 1 days 05:00:00 | 2015-08-10 09:15:00 |
# Ensure datetime
df["crash_date"] = pd.to_datetime(df["crash_date"])
# Weekly record counts
weekly_counts = (
df.set_index("crash_date")
.resample("W")
.size()
)
# Compute overall gap span
gap_span_start = long_gaps["gap_start"].min()
gap_span_end = long_gaps["gap_end"].max()
# Plot full timeline
plt.figure(figsize=(14, 4))
plt.plot(weekly_counts.index, weekly_counts.values, label="Number of weekly records")
plt.axhline(0, linestyle="--", linewidth=1)
# Highlight overall gap timespan
plt.axvspan(
gap_span_start,
gap_span_end,
alpha=0.25,
label="Timespan with gaps >1 day"
)
plt.title("Record Count per Week with Record Gaps Highlighted")
plt.xlabel("Year")
plt.ylabel("Number of Records")
plt.legend()
plt.tight_layout()
plt.show()
# Ensure datetime
df_plot = df_csv_data.copy()
df_plot["crash_date"] = pd.to_datetime(df_plot["crash_date"])
# Loop through injury columns
for col in col_names:
if col not in df_plot.columns:
print(f"Skipping {col}: column not found.")
continue
# Resample weekly and sum values
weekly_sum_col = (
df_plot.set_index("crash_date")[col]
.resample("W-MON")
.sum(min_count=1) # min_count=1 ensures empty weeks show NaN instead of 0
)
# Plot
plt.figure(figsize=(14, 4))
plt.plot(weekly_sum_col.index, weekly_sum_col.values, label=f"Weekly sum ({col})")
plt.axhline(0, linestyle="--", linewidth=1)
plt.title(f"Weekly Sum of {col}")
plt.xlabel("Year")
plt.ylabel("Sum of Injuries")
plt.tight_layout()
plt.show()
C:\Users\jerem\AppData\Local\Temp\ipykernel_47636\950805965.py:3: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. df_plot["crash_date"] = pd.to_datetime(df_plot["crash_date"])
Cleaning Data To clean the data the decision was made to remove all data entries before November of 2017. The reason for that is that there is way more data being sampled after that. The sampling frequency seems inconsistent before and rather consisten after November fo 2017.
# Make a copy and ensure datetime
df_csv_data_cleaned = df_csv_data.copy()
df_csv_data_cleaned["crash_date"] = pd.to_datetime(df_csv_data_cleaned['crash_date'], format='%m/%d/%Y %I:%M:%S %p')
# Remove entries before November 2017
df_csv_data_cleaned = df_csv_data_cleaned[df_csv_data_cleaned["crash_date"] >= "2017-11-01"]
Visualizing removed data.
# Ensure datetime
df_plot = df_csv_data.copy()
df_plot["crash_date"] = pd.to_datetime(df_plot["crash_date"], format='%m/%d/%Y %I:%M:%S %p')
df_cleaned = df_csv_data_cleaned.copy()
df_cleaned["crash_date"] = pd.to_datetime(df_cleaned["crash_date"], format='%m/%d/%Y %I:%M:%S %p')
# Split removed data
df_removed = df_plot[df_plot["crash_date"] < "2017-11-01"]
# Concatenate for consistent weekly resampling
df_combined = pd.concat([df_removed, df_cleaned])
# Loop through injury columns
for col in col_names:
if col not in df_plot.columns:
print(f"Skipping {col}: column not found.")
continue
# Resample weekly over the full timeline
weekly_sum = df_combined.set_index("crash_date")[col].resample("W-MON").sum()
# Separate removed vs cleaned for plotting
weekly_removed = weekly_sum.loc[weekly_sum.index < "2017-11-01"]
weekly_cleaned = weekly_sum.loc[weekly_sum.index >= "2017-11-01"]
# Plot
plt.figure(figsize=(14, 4))
plt.plot(weekly_removed.index, weekly_removed.values, label=f"Removed data")
plt.plot(weekly_cleaned.index, weekly_cleaned.values, label=f"Cleaned data")
plt.axhline(0, linestyle="--", linewidth=1)
plt.title(f"Weekly Sum of {col}")
plt.xlabel("Year")
plt.ylabel("Sum of Injuries")
plt.legend()
plt.tight_layout()
plt.show()
B. Visualization and Exploratory Analysis (55 points)¶
Assigned to Jeremia
- Time-series visualizations (raw, smoothed, rolling mean or windowed views) (10 points)
- Distribution analysis with histograms and density style plots where applicable (10 points)
- Correlation analysis and heatmaps (Pearson and at least one alternative such as Spearman, with short interpretation) (10 points)
- Daily or periodic pattern analysis (day-of-week, hour-of-day, seasonality indicators, or test-cycle patterns) (15 points)
- Summary of observed patterns as short check statements (similar to True/False style) with evidence (10 points)
df_b = df_csv_data_cleaned # copy original data
#print(df_b.head())
# Display a Time Series Plot of weekly accidents
df_b.set_index('crash_date', inplace=True)
Time-series visualization¶
weekly_accidents = df_b.resample('W').size()
plt.figure(figsize=(10, 6))
plt.plot(weekly_accidents.index, weekly_accidents.values, marker='')
plt.title('Weekly Accidents Over Time')
plt.xlabel('Date')
plt.ylabel('Number of Accidents')
plt.grid(True)
plt.show()
Distribution analysis: histograms and density plots¶
We focus on smoothed distributions for key numeric fields and normalized comparisons for major categorical factors using df_b.
# Crash hour distribution by weather (independent density per weather, unshared y-axis for visibility)
hour_weather = df_b.dropna(subset=["crash_hour", "weather_condition"])
g = sns.displot(
hour_weather,
x="crash_hour",
col="weather_condition",
col_wrap=3,
bins=24,
stat="density",
kde=True,
color="teal",
height=3,
aspect=1.2,
common_norm=False, # normalize densities within each weather facet
facet_kws={"sharey": False} # allow y-axis to scale per facet so curves are visible
)
# Set x-axis grid/ticks every 3 hours and update titles for all facets
for ax in g.axes.flatten():
ax.set_xticks(np.arange(0, 24, 3))
ax.grid(True, axis="x")
# Update title to show only the weather condition value
title = ax.get_title()
if "=" in title:
ax.set_title(title.split("=")[1].strip())
plt.suptitle("Crash hour distribution by weather condition", y=1.02)
plt.show()
# Top contributory causes (proportions) to highlight dominant categories
cause_counts = df_b["prim_contributory_cause"].value_counts(normalize=True).head(10)
sns.barplot(
x=cause_counts.values,
y=cause_counts.index,
orient="h",
color="slateblue"
)
plt.title("Top 10 primary contributory causes (share of crashes)")
plt.xlabel("Proportion of crashes")
plt.ylabel("Primary cause")
plt.xlim(0, cause_counts.values.max() * 1.1)
plt.show()
Histogram of total injuries by crash hour¶
Aggregated injuries per hour of day using df_b (weights capture total injuries, not just crash counts).
# Stacked barplot of injury types by crash hour
injury_cols = [
"injuries_fatal",
"injuries_incapacitating",
"injuries_non_incapacitating",
"injuries_reported_not_evident",
]
hour_inj = df_b.dropna(subset=["crash_hour"] + injury_cols)
hour_inj_sum = (
hour_inj.groupby("crash_hour")[injury_cols]
.sum()
.reindex(range(24), fill_value=0)
)
bottom = np.zeros(24)
hours = hour_inj_sum.index
colors = ["#d73027", "#fc8d59", "#fee08b", "#91bfdb"]
plt.figure()
for col, color in zip(injury_cols, colors):
plt.bar(
hours,
hour_inj_sum[col],
bottom=bottom,
label=col.replace("_", " "),
color=color,
edgecolor="white"
)
bottom += hour_inj_sum[col].values
plt.title("Injuries by crash hour (stacked)")
plt.xlabel("Crash hour")
plt.ylabel("Count of injuries")
plt.xticks(range(0, 24, 2))
plt.grid(True, axis="x")
plt.legend(title="Injury type", fontsize=6, title_fontsize=7)
plt.show()
# Normalize injury counts per hour to percentages and plot stacked bars (100% total per hour)
hour_inj_pct = hour_inj_sum.div(hour_inj_sum.sum(axis=1), axis=0).fillna(0) * 100
bottom_pct = np.zeros(len(hours))
plt.figure()
for col, color in zip(injury_cols, colors):
plt.bar(
hour_inj_pct.index,
hour_inj_pct[col],
bottom=bottom_pct,
label=col.replace("_", " "),
color=color,
edgecolor="white"
)
bottom_pct += hour_inj_pct[col].values
plt.title("Injuries by crash hour (stacked, 100% scale)")
plt.xlabel("Crash hour")
plt.ylabel("Share of injuries (%)")
plt.ylim(0, 100)
plt.xticks(range(0, 24, 2))
plt.grid(True, axis="x")
plt.legend(title="Injury type", fontsize=6, title_fontsize=7)
plt.show()
Correlation analysis (Pearson & Spearman)¶
We focus on numeric crash/injury metrics where correlation is meaningful; both Pearson (linear) and Spearman (rank/monotonic) are shown.
# Pearson correlation heatmap (linear relationships)
num_cols = [
"injuries_fatal",
"injuries_incapacitating",
"injuries_non_incapacitating",
"injuries_reported_not_evident",
"num_units",
"crash_hour",
"crash_day_of_week",
"crash_month",
]
corr_df = df_b.reset_index()[num_cols].dropna()
corr_p = corr_df.corr(method="pearson")
plt.figure(figsize=(6, 5))
sns.heatmap(
corr_p,
annot=True,
fmt=".2f",
cmap="coolwarm",
vmin=-1,
vmax=1,
square=True,
cbar=False,
annot_kws={"size": 4}
)
plt.title("Pearson correlation (linear)")
plt.tight_layout()
plt.show()
# Spearman correlation heatmap (rank/monotonic relationships)
corr_s = corr_df.corr(method="spearman")
plt.figure(figsize=(6, 5))
sns.heatmap(
corr_s,
annot=True,
fmt=".2f",
cmap="coolwarm",
vmin=-1,
vmax=1,
square=True,
cbar=False,
annot_kws={"size": 4}
)
plt.title("Spearman correlation (rank)")
plt.tight_layout()
plt.show()
Seasonality by week of year¶
Mean crashes per ISO week (across years) with variability to highlight weekly seasonality.
# Weekly seasonality across years (mean crashes per ISO week with variability)
# Ensure datetime index
if not np.issubdtype(df_b.index.dtype, np.datetime64):
df_b = df_b.copy()
df_b.index = pd.to_datetime(df_b.index)
weekly_counts = df_b.resample("W").size()
iso_week = weekly_counts.index.isocalendar().week
weekly_stats = weekly_counts.groupby(iso_week).agg(["mean", "std"])
weeks = sorted(weekly_stats.index)
mean_vals_w = weekly_stats.loc[weeks, "mean"]
std_vals_w = weekly_stats.loc[weeks, "std"]
plt.figure(figsize=(10, 4))
plt.bar(weeks, mean_vals_w, yerr=std_vals_w, color="seagreen", edgecolor="white", alpha=0.9, capsize=2)
plt.xticks(range(0, 53, 4))
plt.title("Average crashes per ISO week (multi-year)")
plt.xlabel("ISO week number")
plt.ylabel("Mean crashes per week")
plt.tight_layout()
plt.show()
Seasonality by month¶
Crashes aggregated across multiple years to reveal month-of-year patterns (mean monthly crashes with variability).
# Monthly seasonality across years (mean crashes per month with variability)
import calendar
# Ensure datetime index
if not np.issubdtype(df_b.index.dtype, np.datetime64):
df_b = df_b.copy()
df_b.index = pd.to_datetime(df_b.index)
monthly_counts = df_b.resample("ME").size()
monthly_stats = monthly_counts.groupby(monthly_counts.index.month).agg(["mean", "std"])
months = range(1, 13)
mean_vals = monthly_stats["mean"].reindex(months)
std_vals = monthly_stats["std"].reindex(months)
plt.figure(figsize=(8, 4))
plt.bar(months, mean_vals, yerr=std_vals, color="steelblue", edgecolor="white", alpha=0.9, capsize=3)
plt.xticks(months, [calendar.month_abbr[m] for m in months], rotation=0)
plt.title("Average crashes per month (multi-year)")
plt.xlabel("Month")
plt.ylabel("Mean crashes per month")
plt.tight_layout()
plt.show()
Summary¶
- Crash hour distribution differs by weather condition.
- The top cause is "failing to yield right-of-way"
- The Rushhours have peaks in crashes. At 7-8 h and 15-17 h
- Fatal and incapacitating injuries are relatively more in night time.
- The numerical columns don't correlate.
- In the Chrisms holydays the number of accidents is reduced.
C. Probability and Event Analysis (45 points)¶
Assigned to Tim
- Threshold-based probability estimation for events (define event, justify threshold, compute empirical probability) (15 points)
- Cross tabulation analysis for two variables (10 points)
- Conditional probability analysis (at least two meaningful conditional relationships) (15 points)
- Summary of observations and limitations (what could bias these estimates, what assump- tions were made) (5 points)
Threshold-based probability estimation¶
# Important Notes:
# purely Data-Driven
# Depends on Size of data, Observation period Data quality
# Event 1: Late hour Crash (between 20:00 and 5:00)
# Relevance: Poor visibility, fatigue, alcohol...
# Thresholds: typical night hours.
late_hour_event = (
(df_csv_data_cleaned["crash_hour"] >= 20) |
(df_csv_data_cleaned["crash_hour"] <= 5)
)
num_late_hour_crashes = late_hour_event.sum()
total_crashes = len(df_csv_data_cleaned)
empirical_probability_late_hour = num_late_hour_crashes / total_crashes
print(f"Empirical Probability Event 1 (Late Hour Crash): {empirical_probability_late_hour:.3%}")
# Event 2: Rush-hour crash (7:00-9:00 and 16:00-18:00)
# Relevance: High traffic density, frequent stop-and-go driving, time pressure and stress
# Thresholds: standard commuting periods
rush_hour_event = (
((df_csv_data_cleaned["crash_hour"] >= 7) & (df_csv_data_cleaned["crash_hour"] <= 9)) |
((df_csv_data_cleaned["crash_hour"] >= 16) & (df_csv_data_cleaned["crash_hour"] <= 18))
)
num_rush_hour_crashes = rush_hour_event.sum()
total_crashes = len(df_csv_data_cleaned)
empirical_probability_rush_hour = num_rush_hour_crashes / total_crashes
print(f"Empirical Probability Event 2 (Rush Hour Crash): {empirical_probability_rush_hour:.3%}")
# Event 3: Severe crash
# Relevance: fatalities, or life-altering injuries requiring intensive medical care
# Trhesholds: greater than one injury, focus on severity rather than frequency, avoids dilution by
# minor or non-injury crashes
severe_crash_event = (
(df_csv_data_cleaned["injuries_fatal"] >= 1) |
(df_csv_data_cleaned["injuries_incapacitating"] >= 1)
)
num_severe_crashes = severe_crash_event.sum()
total_crashes = len(df_csv_data_cleaned)
empirical_probability_severe = num_severe_crashes / total_crashes
print(
f"Empirical Probability Event 3 (Severe Crash): "
f"{empirical_probability_severe:.3%}"
)
# Event 4: Non Severe crash
# Relevance: Most of the Accidents are non severe crashes with no major injuries
non_severe_crash_event = (
(df_csv_data_cleaned["injuries_fatal"] == 0) &
(df_csv_data_cleaned["injuries_incapacitating"] == 0)
)
num_non_severe_crashes = non_severe_crash_event.sum()
total_crashes = len(df_csv_data_cleaned)
empirical_probability_non_severe = num_non_severe_crashes / total_crashes
print(
f"Empirical Probability Event 4 (Non-Severe Crash): "
f"{empirical_probability_non_severe:.3%}"
)
# Since this is the complement of the severe crash event, the added Probability should be 1
print(
f"Empirical Probability Non-Severe Crash + Severe Crash: "
f"{empirical_probability_non_severe + empirical_probability_severe:.3%}"
)
# Event 5: Rainy weather crash
# Relevance: Rain and freezing rain: reduce visability, decrease tire-road friction, increase
# braking distances
# Threshold: Freezing rain is grouped with rain, since both involve liquid precipation effecting the
# surface friction
rainy_weather_event = df_csv_data_cleaned["weather_condition"].isin([
"RAIN",
"FREEZING RAIN/DRIZZLE"
])
num_rainy_weather_crashes = rainy_weather_event.sum()
total_crashes = len(df_csv_data_cleaned)
empirical_probability_rainy_weather = num_rainy_weather_crashes / total_crashes
print(
f"Empirical Probability Event 5 (Rainy Weather Crash): "
f"{empirical_probability_rainy_weather:.3%}"
)
# Event 6: Snowy weather crash
# Relevance: Severly reduced friction, impair vehicle control, may obscure lane markings and ohter
# vehicles
# Threshold: Snow-related categories are grouped due to similar driving hazards
snowy_weather_event = df_csv_data_cleaned["weather_condition"].isin([
"SNOW",
"BLOWING SNOW"
])
num_snowy_weather_crashes = snowy_weather_event.sum()
total_crashes = len(df_csv_data_cleaned)
empirical_probability_snowy_weather = num_snowy_weather_crashes / total_crashes
print(
f"Empirical Probability Event 6 (Snowy Weather Crash): "
f"{empirical_probability_snowy_weather:.3%}"
)
# Event 7: Poor visibility weather crash
# Relevance: these conditions reduce a driver's ability to: -detect hazards, -judge distance and
# speed, -react in time
# Threshold: Only conditions with direct visibility impairment are included, Categories are chosen
# conservatively to avoid overgeneralization
poor_visibility_event = df_csv_data_cleaned["weather_condition"].isin([
"FOG/SMOKE/HAZE",
"SLEET/HAIL",
"BLOWING SNOW",
"BLOWING SAND, SOIL, DIRT"
])
num_poor_visibility_crashes = poor_visibility_event.sum()
total_crashes = len(df_csv_data_cleaned)
empirical_probability_poor_visibility = num_poor_visibility_crashes / total_crashes
print(
f"Empirical Probability Event 7 (Poor Visibility Weather Crash): "
f"{empirical_probability_poor_visibility:.3%}"
)
# Event 8: Clear
# Relevance: Baseline driving environment, minimal weather-related visual impairment.
# Threshold: Cloudy/overcast conditions usually do not reduce visibility significantly, provides a
# meaningful contrast to poor visibility events
clear_weather_event = df_csv_data_cleaned["weather_condition"].isin([
"CLEAR",
"CLOUDY/OVERCAST"
])
num_clear_weather_crashes = clear_weather_event.sum()
total_crashes = len(df_csv_data_cleaned)
empirical_probability_clear_weather = num_clear_weather_crashes / total_crashes
print(
f"Empirical Probability Event 8 (Clear Weather Crash): "
f"{empirical_probability_clear_weather:.3%}"
)
Empirical Probability Event 1 (Late Hour Crash): 22.856% Empirical Probability Event 2 (Rush Hour Crash): 35.926% Empirical Probability Event 3 (Severe Crash): 3.397% Empirical Probability Event 4 (Non-Severe Crash): 96.603% Empirical Probability Non-Severe Crash + Severe Crash: 100.000% Empirical Probability Event 5 (Rainy Weather Crash): 10.532% Empirical Probability Event 6 (Snowy Weather Crash): 3.552% Empirical Probability Event 7 (Poor Visibility Weather Crash): 0.394% Empirical Probability Event 8 (Clear Weather Crash): 82.141%
# Visualization
# Bar Plot
event_probabilities = pd.DataFrame({
"Event": [
"Late Hour (20–5)",
"Rush Hour (7–9, 16–18)",
"Severe Crash",
"Non-Severe Crash",
"Rainy Weather",
"Snowy Weather",
"Poor Visibility",
"Clear Weather"
],
"Probability": [
empirical_probability_late_hour,
empirical_probability_rush_hour,
empirical_probability_severe,
empirical_probability_non_severe,
empirical_probability_rainy_weather,
empirical_probability_snowy_weather,
empirical_probability_poor_visibility,
empirical_probability_clear_weather
]
})
plt.figure(figsize=(12, 6))
plt.bar(event_probabilities["Event"], event_probabilities["Probability"])
plt.ylabel("Empirical Probability")
plt.title("Empirical Probabilities of Defined Crash Events")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()
Cross tabulation analysis¶
# Function to display Cross tabulation
def plot_crosstab_table(crosstab, title, xlabel, ylabel):
# Absolute values
counts = crosstab.values
# Row-wise proportions
proportions = crosstab.div(crosstab.sum(axis=1), axis=0).values
n_rows, n_cols = counts.shape
plt.figure()
plt.imshow(np.ones_like(counts), vmin=0, vmax=1) # neutral background
plt.title(title)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
# Axis ticks
plt.xticks(range(n_cols), crosstab.columns.astype(str))
plt.yticks(range(n_rows), crosstab.index.astype(str))
# Grid lines
plt.grid(False)
plt.gca().set_xticks(np.arange(-.5, n_cols, 1), minor=True)
plt.gca().set_yticks(np.arange(-.5, n_rows, 1), minor=True)
plt.grid(which="minor")
# Cell annotations
for i in range(n_rows):
for j in range(n_cols):
plt.text(
j, i,
f"{counts[i, j]}\n({proportions[i, j]*100:.1f}%)",
ha="center",
va="center"
)
plt.tight_layout()
plt.show()
# Cross Tabulation 1: Late-Hour Crash vs Severe Crash
# Relevance: Night driving is associated with: -fatigue, -reduced visibility, -alcohol.
# What we analyze: Are severe crashes over-represented during late hours?
crosstab_late_severe = pd.crosstab(
late_hour_event,
severe_crash_event,
rownames=["Late Hour Crash"],
colnames=["Severe Crash"]
)
print("Cross Tabulation: Late-Hour Crash vs Severe Crash")
print(crosstab_late_severe)
plot_crosstab_table(
crosstab_late_severe,
title="Late-Hour Crash vs Severe Crash",
xlabel="Severe Crash",
ylabel="Late-Hour Crash"
)
# Cross Tabulation 2: Rush-Hour Crash vs Non-Severe Crash
# Relevance: Rush hours involve: -high traffic density, -lower speeds, -frequent stop and go
# situations
# Hypothesis: more crashes, but often less severe
# What we analyze: What is the proportion of non-severe crashes higher during rush hours?
crosstab_rush_non_severe = pd.crosstab(
rush_hour_event,
non_severe_crash_event,
rownames=["Rush Hour Crash"],
colnames=["Non-Severe Crash"]
)
print("\nCross Tabulation: Rush-Hour Crash vs Non-Severe Crash")
print(crosstab_rush_non_severe)
plot_crosstab_table(
crosstab_rush_non_severe,
title="Rush-Hour Crash vs Non-Severe Crash",
xlabel="Non-Severe Crash",
ylabel="Rush-Hour Crash"
)
# Cross Tabulation 3: Poor Visibility vs Severe Crash
# Relevance: Visability directly affects reaction time. Well established relationship in traffic
# safty research
# What we analyze: Do severe crashes occur more frequently under poor visibility conditions?
crosstab_visibility_severe = pd.crosstab(
poor_visibility_event,
severe_crash_event,
rownames=["Poor Visibility"],
colnames=["Severe Crash"]
)
print("\nCross Tabulation: Poor Visibility vs Severe Crash")
print(crosstab_visibility_severe)
plot_crosstab_table(
crosstab_visibility_severe,
title="Poor Visibility vs Severe Crash",
xlabel="Severe Crash",
ylabel="Poor Visibility"
)
# Cross Tabulation 4: Clear Weather vs Non-Severe Crash
# Relevance: Clear conditions serve as a baseline
# What we analyze: Are crashes under clear weather more likely to be non-severe?
crosstab_clear_non_severe = pd.crosstab(
clear_weather_event,
non_severe_crash_event,
rownames=["Clear Weather"],
colnames=["Non-Severe Crash"]
)
print("\nCross Tabulation: Clear Weather vs Non-Severe Crash")
print(crosstab_clear_non_severe)
plot_crosstab_table(
crosstab_clear_non_severe,
title="Clear Weather vs Non-Severe Crash",
xlabel="Non-Severe Crash",
ylabel="Clear Weather"
)
Cross Tabulation: Late-Hour Crash vs Severe Crash Severe Crash False True Late Hour Crash False 139608 4344 True 40655 1994
Cross Tabulation: Rush-Hour Crash vs Non-Severe Crash Non-Severe Crash False True Rush Hour Crash False 4407 115155 True 1931 65108
Cross Tabulation: Poor Visibility vs Severe Crash Severe Crash False True Poor Visibility False 179546 6319 True 717 19
Cross Tabulation: Clear Weather vs Non-Severe Crash Non-Severe Crash False True Clear Weather False 931 32394 True 5407 147869
Conditional Probability¶
# Conditional Probability 1:
# Probability of a severe crash given a late-hour crash
# Relevance: Late hours are associated with fatigue, alcohol, and reduced visibility,
# which may increase crash severity.
# Marginal probabilities
P_severe = severe_crash_event.sum() / len(df_csv_data_cleaned)
P_late_hour = late_hour_event.sum() / len(df_csv_data_cleaned)
# Conditional probabilities
P_severe_given_late = (
severe_crash_event & late_hour_event
).sum() / late_hour_event.sum()
P_late_given_severe = (
severe_crash_event & late_hour_event
).sum() / severe_crash_event.sum()
print(f"P(Severe Crash): {P_severe:.3%}")
print(f"P(Late-Hour Crash): {P_late_hour:.3%}")
print(f"P(Severe Crash | Late-Hour Crash): {P_severe_given_late:.3%}")
print(f"P(Late-Hour Crash | Severe Crash): {P_late_given_severe:.3%}")
# Conditional Probability 2:
# Probability of a severe crash given poor visibility conditions
# Relevance: Poor visibility impairs hazard perception and reaction time,
# potentially leading to higher-impact collisions.
# Marginal probabilities
P_poor_visibility = poor_visibility_event.sum() / len(df_csv_data_cleaned)
# Conditional probabilities
P_severe_given_poor_visibility = (
severe_crash_event & poor_visibility_event
).sum() / poor_visibility_event.sum()
P_poor_visibility_given_severe = (
severe_crash_event & poor_visibility_event
).sum() / severe_crash_event.sum()
print(f"P(Severe Crash): {P_severe:.3%}")
print(f"P(Poor Visibility Crash): {P_poor_visibility:.3%}")
print(f"P(Severe Crash | Poor Visibility): {P_severe_given_poor_visibility:.3%}")
print(f"P(Poor Visibility Crash | Severe Crash): {P_poor_visibility_given_severe:.3%}")
# Conditional Probability 3:
# Probability of a non-severe crash given a rush-hour crash
# Relevance: Rush-hour traffic typically involves lower speeds due to congestion,
# which may reduce injury severity despite higher crash frequency.
# Marginal probabilities
P_non_severe = non_severe_crash_event.sum() / len(df_csv_data_cleaned)
P_rush_hour = rush_hour_event.sum() / len(df_csv_data_cleaned)
# Conditional probabilities
P_non_severe_given_rush = (
non_severe_crash_event & rush_hour_event
).sum() / rush_hour_event.sum()
P_rush_given_non_severe = (
non_severe_crash_event & rush_hour_event
).sum() / non_severe_crash_event.sum()
print(f"P(Non-Severe Crash): {P_non_severe:.3%}")
print(f"P(Rush-Hour Crash): {P_rush_hour:.3%}")
print(f"P(Non-Severe Crash | Rush-Hour Crash): {P_non_severe_given_rush:.3%}")
print(f"P(Rush-Hour Crash | Non-Severe Crash): {P_rush_given_non_severe:.3%}")
P(Severe Crash): 3.397% P(Late-Hour Crash): 22.856% P(Severe Crash | Late-Hour Crash): 4.675% P(Late-Hour Crash | Severe Crash): 31.461% P(Severe Crash): 3.397% P(Poor Visibility Crash): 0.394% P(Severe Crash | Poor Visibility): 2.582% P(Poor Visibility Crash | Severe Crash): 0.300% P(Non-Severe Crash): 96.603% P(Rush-Hour Crash): 35.926% P(Non-Severe Crash | Rush-Hour Crash): 97.120% P(Rush-Hour Crash | Non-Severe Crash): 36.118%
Summary of Observations¶
Observations:
- Contrary to intuitive expectations, the conditional probabilities of severe crashes under
late-hour conditions and poor visibility conditions are relatively low and of similar magnitude.
P(Severe Crash∣Late Hour)≈4.7%,
𝑃(Severe Crash∣Poor Visibility)≈2.6%,
𝑃(Severe Crash∣Rush Hour)≈2.29% - Conditions that increase the likelihood of crashes do not necessarily increase the severity of outcomes.
Assumptions:
- All crashes are treated as independent observations, even though multiple crashes may involve similar locations, times, or drivers.
- Binary event definitions (e.g., severe vs non-severe) adequately capture the complexity of crash outcomes.
- Weather and lighting conditions recorded at the time of the crash correctly represent the actual driving conditions experienced by drivers.
- The chosen thresholds (e.g., late hours, rush hours, severity definitions) meaningfully separate different risk regimes.
Potential Bias and Limitations:
- Fatal and incapacitating injury crashes are relatively rare.
- This can lead to unstable probability estimates and large relative changes from small absolute differences.
- Injury severity may be misclassified or updated after initial reporting.
- Weather and visibility conditions are categorical and may not capture intensity (e.g., light vs heavy rain).
- Time-based thresholds (late hours, rush hours) are somewhat arbitrary and based on common conventions.
- Slight changes to these thresholds could alter probability estimates.
- Variables such as speed, driver age, alcohol involvement, and road type are not controlled for.
- Traffic volume is not directly observed, limiting interpretation of frequency-based probabilities.
D. Statistical Theory Applications (45 points)¶
Assigned to Tim
- Law of Large Numbers demonstration (15 points)
- Central Limit Theorem application (sampling distributions, effect of sample size, interpretation) (25 points)
- Result interpretation and sanity checks (what would invalidate your conclusion, what you verified) (5 points)
Law of Large Numbers¶
# Convert severe crash event to indicator variable
# 1 = severe crash, 0 = non-severe crash
severe_indicator = severe_crash_event.astype(int).values
# Shuffle observations to simulate random order
np.random.seed(42)
np.random.shuffle(severe_indicator)
# Sample sizes
sample_sizes = np.arange(1, len(severe_indicator) + 1)
# Cumulative empirical probability
cumulative_mean = np.cumsum(severe_indicator) / sample_sizes
# Overall empirical probability (reference)
overall_probability = cumulative_mean[-1]
# Zoomed in Plot
max_samples = 100
plt.figure()
plt.plot(
sample_sizes[:max_samples],
cumulative_mean[:max_samples],
label="Cumulative Empirical Probability"
)
plt.axhline(
overall_probability,
linestyle="--",
label="Overall Empirical Probability"
)
plt.xlabel("Sample Size")
plt.ylabel("Empirical Probability of Severe Crash")
plt.title(
"Law of Large Numbers Demonstration\n"
"Empirical Probability (First 100)"
)
plt.legend()
plt.show()
max_samples = 1000
plt.figure()
plt.plot(
sample_sizes[:max_samples],
cumulative_mean[:max_samples],
label="Cumulative Empirical Probability"
)
plt.axhline(
overall_probability,
linestyle="--",
label="Overall Empirical Probability"
)
plt.xlabel("Sample Size")
plt.ylabel("Empirical Probability of Severe Crash")
plt.title(
"Law of Large Numbers Demonstration\n"
"Empirical Probability (First 1000)"
)
plt.legend()
plt.show()
max_samples = 5000
plt.figure()
plt.plot(
sample_sizes[:max_samples],
cumulative_mean[:max_samples],
label="Cumulative Empirical Probability"
)
plt.axhline(
overall_probability,
linestyle="--",
label="Overall Empirical Probability"
)
plt.xlabel("Sample Size")
plt.ylabel("Empirical Probability of Severe Crash")
plt.title(
"Law of Large Numbers Demonstration\n"
"Empirical Probability (First 5000)"
)
plt.legend()
plt.show()
max_samples = 10000
plt.figure()
plt.plot(
sample_sizes[:max_samples],
cumulative_mean[:max_samples],
label="Cumulative Empirical Probability"
)
plt.axhline(
overall_probability,
linestyle="--",
label="Overall Empirical Probability"
)
plt.xlabel("Sample Size")
plt.ylabel("Empirical Probability of Severe Crash")
plt.title(
"Law of Large Numbers Demonstration\n"
"Empirical Probability (First 10000)"
)
plt.legend()
plt.show()
# Plot
plt.figure()
plt.plot(sample_sizes, cumulative_mean, label="Cumulative Empirical Probability")
plt.axhline(
overall_probability,
linestyle="--",
label="Overall Empirical Probability"
)
plt.xlabel("Sample Size")
plt.ylabel("Empirical Probability of Severe Crash")
plt.title("Law of Large Numbers Demonstration\nConvergence of Empirical Probability (Full Data Set)")
plt.legend()
plt.show()
Observations¶
When considering only the first 1,000 observations, the empirical probability appears to converge
toward a lower value. As the sample size increases, the estimate exhibits decreasing fluctuations
and begins to stabilize. Around 10,000 observations, the empirical probability converges closely
to the overall value, illustrating the stabilizing effect predicted by the Law of Large Numbers.
Additionally, the apparent convergence observed at small sample sizes is unstable and sensitive
to
random variation, emphasizing that early estimates may be misleading without sufficiently
large samples.
Central Limit Theorem application¶
np.random.seed(42)
# Indicator variable: 1 = severe crash, 0 = non-severe crash
data = severe_crash_event.astype(int).values
# Number of bootstrap samples
num_samples = 1000
# Sample sizes
n_30 = 30
n_1000 = 1000
n_10000 = 10000
# Sampling distributions
sample_means_30 = [
np.mean(np.random.choice(data, size=n_30, replace=True))
for _ in range(num_samples)
]
sample_means_1000 = [
np.mean(np.random.choice(data, size=n_1000, replace=True))
for _ in range(num_samples)
]
sample_means_10000 = [
np.mean(np.random.choice(data, size=n_10000, replace=True))
for _ in range(num_samples)
]
# Plot: Sampling distribution (n = 30)
plt.figure()
plt.hist(sample_means_30, bins=30)
plt.xlabel("Sample Mean")
plt.ylabel("Frequency")
plt.title("Sampling Distribution of the Mean (n = 30)")
plt.show()
# Plot: Sampling distribution (n = 1000)
plt.figure()
plt.hist(sample_means_1000, bins=30)
plt.xlabel("Sample Mean")
plt.ylabel("Frequency")
plt.title("Sampling Distribution of the Mean (n = 1000)")
plt.show()
# Plot: Sampling distribution (n = 10000)
plt.figure()
plt.hist(sample_means_10000, bins=30)
plt.xlabel("Sample Mean")
plt.ylabel("Frequency")
plt.title("Sampling Distribution of the Mean (n = 10000)")
plt.show()
Observations¶
- Each histogram represents the distribution of sample means
- Not the distribution of individual crashes
Effect of sample size:
- For 𝑛=30: wider, more irregular distribution, looks more like an exponential function
- For 𝑛=1000: narrower and more symmetric, but still with big spikes
- For 𝑛=1000: even more symmetric, a normal shape is distinguishable
Despite the binary nature of the original data, the sampling distribution of the mean approaches a normal shape
Observations and Sanity check¶
Observations:
The Law of Large Numbers is supported by the observed convergence of the empirical probability of a
severe crash as the sample size increases. While early estimates fluctuate substantially, the
cumulative mean stabilizes as more observations are included, indicating convergence toward a stable
long-run value.
The Central Limit Theorem is illustrated by the behavior of the sampling distributions of the sample
mean. As sample size increases, the distributions become more concentrated and symmetric, consistent
with the theoretical prediction that sample means approach a normal distribution even when the
underlying data are not normally distributed.
Sanity Check:
- The observations were randomly shuffled prior to analysis to avoid artifacts caused by ordering
effects in the dataset. - Multiple sample sizes were examined to verify that convergence and distributional changes were not
specific to a single choice of sample size. - A large number of repeated samples were used to ensure that the observed sampling distributions
were stable.
What would Invalidate These Conclusions:
- Strong dependence between observations (e.g., repeated crashes involving the same drivers or
locations) would violate the independence assumption underlying both the Law of Large Numbers and
the Central Limit Theorem. - Severe underreporting or systematic misclassification of crash severity would distort the
empirical distributions and bias convergence behavior. - Insufficient sample sizes would undermine the reliability of the observed convergence and sampling distributions.
E. Regression and Predictive Modeling (45 points)¶
Assigned to Lorenz
- Define a prediction target and features (justify why they make sense) (10 points)
- Linear or polynomial model selection (include rationale and show at least two candidates) (10 points)
- Model fitting and validation (train-test split appropriate for time-series. e.g., time-based split) (15 points)
- Residual analysis and interpretation (errors, bias, failure cases, what to improve next) (10 points)
import xgboost as xgb
from sklearn.metrics import mean_squared_error
create new feature for prediction: injuries_per_day
# Create a new dataframe with daily total injuries
df_daily_data = df_csv_data_cleaned.groupby(df_csv_data_cleaned['crash_date'].dt.date)['injuries_total'].sum().reset_index()
# apply a rolling mean with a window of 7 days
df_daily_data.columns = ['crash_date', 'injuries_total']
df_daily_data['injuries_total'] = df_daily_data['injuries_total'].rolling(window=7, center=True).mean()
# Remove all entries where there is a NaN value
df_daily_data = df_daily_data.dropna()
df_daily_data["crash_date"] = pd.to_datetime(df_daily_data["crash_date"])
df_daily_data["dayofweek"] = df_daily_data["crash_date"].dt.day_name()
df_daily_data["day_of_year"] = df_daily_data["crash_date"].dt.dayofyear
df_daily_data["quarter"] = df_daily_data["crash_date"].dt.quarter
plt.figure(figsize=(14, 6))
plt.plot(df_daily_data['crash_date'], df_daily_data['injuries_total'], linewidth=1.5)
plt.xlabel('Date')
plt.ylabel('Daily Total Injuries')
plt.title('Daily Total Injuries Over Time')
plt.tight_layout()
plt.show()
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) File c:\Users\jerem\miniconda3\envs\jupyter311\Lib\site-packages\pandas\core\indexes\base.py:3812, in Index.get_loc(self, key) 3811 try: -> 3812 return self._engine.get_loc(casted_key) 3813 except KeyError as err: File pandas/_libs/index.pyx:167, in pandas._libs.index.IndexEngine.get_loc() File pandas/_libs/index.pyx:196, in pandas._libs.index.IndexEngine.get_loc() File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item() File pandas/_libs/hashtable_class_helper.pxi:7096, in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'crash_date' The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) Cell In[90], line 3 1 # Create a new dataframe with daily total injuries ----> 3 df_daily_data = df_csv_data_cleaned.groupby(df_csv_data_cleaned['crash_date'].dt.date)['injuries_total'].sum().reset_index() 5 # apply a rolling mean with a window of 7 days 6 df_daily_data.columns = ['crash_date', 'injuries_total'] File c:\Users\jerem\miniconda3\envs\jupyter311\Lib\site-packages\pandas\core\frame.py:4113, in DataFrame.__getitem__(self, key) 4111 if self.columns.nlevels > 1: 4112 return self._getitem_multilevel(key) -> 4113 indexer = self.columns.get_loc(key) 4114 if is_integer(indexer): 4115 indexer = [indexer] File c:\Users\jerem\miniconda3\envs\jupyter311\Lib\site-packages\pandas\core\indexes\base.py:3819, in Index.get_loc(self, key) 3814 if isinstance(casted_key, slice) or ( 3815 isinstance(casted_key, abc.Iterable) 3816 and any(isinstance(x, slice) for x in casted_key) 3817 ): 3818 raise InvalidIndexError(key) -> 3819 raise KeyError(key) from err 3820 except TypeError: 3821 # If we have a listlike key, _check_indexing_error will raise 3822 # InvalidIndexError. Otherwise we fall through and re-raise 3823 # the TypeError. 3824 self._check_indexing_error(key) KeyError: 'crash_date'
Train-Test Split
df_daily_data["crash_date"] = pd.to_datetime(df_daily_data["crash_date"])
train = df_daily_data[df_daily_data["crash_date"] < "2024-01-01"]
test = df_daily_data[df_daily_data["crash_date"] >= "2024-01-01"]
# Visualization: Train and test data on a timeline
plt.figure(figsize=(14, 6))
# Plot train data
plt.plot(train["crash_date"], train["injuries_total"], label="Train Data", linewidth=1.5)
# Plot test data
plt.plot(test["crash_date"], test["injuries_total"], label="Test Data", linewidth=1.5)
# Add vertical line at split point
split_date = pd.to_datetime("2024-01-01")
plt.axvline(x=split_date, color="black", linestyle="--", linewidth=2, label="Train-Test Split")
plt.xlabel("Date")
plt.ylabel("Daily Total Injuries")
plt.title("Train-Test Split: Daily Total Injuries Over Time")
plt.legend()
plt.tight_layout()
plt.show()
Create Model with Improved Features
from sklearn.preprocessing import LabelEncoder
# Use calendar features with day_of_year for smooth seasonal trends (no month steps)
df_daily_data_clean = df_daily_data.copy()
# Recreate train/test split
train = df_daily_data_clean[df_daily_data_clean["crash_date"] < "2024-01-01"]
test = df_daily_data_clean[df_daily_data_clean["crash_date"] >= "2024-01-01"]
FEATURES = ["dayofweek", "day_of_year", "quarter"]
TARGET = "injuries_total"
# Encode dayofweek to numeric
le = LabelEncoder()
X_train = train[FEATURES].copy()
X_train["dayofweek"] = le.fit_transform(X_train["dayofweek"])
X_test = test[FEATURES].copy()
X_test["dayofweek"] = le.transform(X_test["dayofweek"])
y_train = train[TARGET]
y_test = test[TARGET]
reg = xgb.XGBRegressor(
n_estimators = 100,
early_stopping_rounds = 15,
learning_rate = 0.1,
max_depth = 3,
subsample = 0.8,
colsample_bytree = 0.8,
reg_lambda = 2.0,
reg_alpha = 0.5
)
reg.fit(
X_train,
y_train,
eval_set=[(X_train, y_train), (X_test, y_test)],
verbose=True
)
# Only create predictions for test data
test_with_pred = test.copy()
test_with_pred["prediction"] = reg.predict(X_test)
# Plot full timeline
plt.figure(figsize=(16, 6))
plt.plot(train["crash_date"], train["injuries_total"], label='Training Data', linewidth=1.5, alpha=0.7)
plt.plot(test["crash_date"], test["injuries_total"], label='Test Data (Actual)', linewidth=1.5, alpha=0.7)
plt.plot(test_with_pred["crash_date"], test_with_pred["prediction"], label='Predictions', linewidth=1.5, linestyle='--', alpha=0.7)
# Add vertical line at train-test split
split_date = pd.to_datetime("2024-01-01")
plt.axvline(x=split_date, color="red", linestyle="--", linewidth=2, label="Train-Test Split")
plt.xlabel('Date')
plt.ylabel('Daily Total Injuries')
plt.title('Full Timeline: Original Data vs Predictions')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()
test["prediction"] = reg.predict(X_test)
plt.figure(figsize=(14, 6))
plt.plot(test["crash_date"], test["injuries_total"], label='Original Data', linewidth=1.5)
plt.plot(test["crash_date"], test["prediction"], label='Predictions', linewidth=1.5, linestyle='--')
plt.xlabel('Date')
plt.ylabel('Daily Total Injuries')
plt.title('Original Data vs Predictions (2024)')
plt.legend()
plt.tight_layout()
plt.show()
# --- Evaluation metrics (test set) ---
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# Prepare arrays
y_true = test["injuries_total"].values
y_pred = test["prediction"].values
plt.figure(figsize=(14, 6))
plt.scatter(test["crash_date"], test["injuries_total"], label='Original Data', alpha=0.6)
plt.scatter(test["crash_date"], test["prediction"], label='Predictions', alpha=0.6)
plt.xlabel('Date')
plt.ylabel('Daily Total Injuries')
plt.title('Original Data vs Predictions (2024)')
plt.legend()
plt.tight_layout()
plt.show()
# Metrics
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
# Handle zeros for MAPE
nonzero_mask = y_true != 0
mape = (np.abs((y_true[nonzero_mask] - y_pred[nonzero_mask]) / y_true[nonzero_mask]).mean() * 100) if nonzero_mask.any() else np.nan
print(f"Test RMSE: {rmse:.3f}")
print(f"Test MAE: {mae:.3f}")
print(f"Test R^2: {r2:.3f}")
print(f"Test MAPE: {mape:.2f}%\n")
# Residuals distribution
residuals = y_true - y_pred
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Left plot: Residuals histogram
sns.histplot(residuals, kde=True, bins=30, ax=axes[0])
axes[0].set_title('Residuals Distribution (Test set)')
axes[0].set_xlabel('Residual (Actual - Predicted)')
# Right plot: Predicted vs Actual scatter
axes[1].scatter(y_true, y_pred, alpha=0.6)
lims = [min(y_true.min(), y_pred.min()), max(y_true.max(), y_pred.max())]
axes[1].plot(lims, lims, 'k--', linewidth=1)
axes[1].set_xlabel('Actual')
axes[1].set_ylabel('Predicted')
axes[1].set_title('Predicted vs Actual (Test set)')
axes[1].set_aspect('equal', adjustable='box')
plt.tight_layout()
plt.show()
The model was able to roughly predict the values. There are visible plateaus in the predicted data. The reason for that is that not enough features were supplied to the model.
The predicted values are usually also smaller than the actual values. The reason for this is that in the year of 2024 more injuries occured than in the previos years the model was trained on.
Possible improvements are that more features should be used to discribe the fluctuations. Also the input data could be pre-processed better. For example outliers could be removed.
F. Dimensionality Reduction and Statistical Tests (25 points)¶
Assigned to Jeremia
Dimensionality Reduction (25 points)¶
- PCA projection and interpretation (variance explained, what clusters or separations mean) (10 points)
- t-SNE embedding with justified hyperparameters (perplexity or similar) and interpretation (7 points)
- UMAP embedding with justified hyperparameters (neighbors, min dist or similar) and interpretation (8 points)
PCA Projection¶
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Select numeric columns
num_cols = [
"injuries_total", "injuries_fatal", "injuries_incapacitating",
"injuries_non_incapacitating", "injuries_reported_not_evident",
"injuries_no_indication", "num_units", "crash_hour",
"crash_day_of_week", "crash_month"
]
X = df_csv_data_cleaned.reset_index()[num_cols].dropna()
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Fit PCA (all components first to see variance)
pca = PCA()
pca.fit(X_scaled)
# Variance explained
plt.figure()
plt.bar(range(1, len(pca.explained_variance_ratio_) + 1), pca.explained_variance_ratio_)
plt.xlabel("Principal Component")
plt.ylabel("Variance Explained")
plt.title("PCA: Variance Explained by Component")
plt.show()
print("Cumulative variance explained:", np.cumsum(pca.explained_variance_ratio_))
pca2 = PCA(n_components=2)
X_pca = pca2.fit_transform(X_scaled)
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.3, s=5)
plt.xlabel(f"PC1 ({pca2.explained_variance_ratio_[0]*100:.1f}%)")
plt.ylabel(f"PC2 ({pca2.explained_variance_ratio_[1]*100:.1f}%)")
plt.title("PCA projection (2D)")
plt.show()
# Loadings
loadings = pd.DataFrame(
pca2.components_.T,
columns=["PC1", "PC2"],
index=num_cols
)
print(loadings)
# Color by crash_type (align indices)
labels = df_csv_data_cleaned.reset_index().loc[X.index, "crash_type"]
plt.figure(figsize=(8, 6))
for label in labels.unique():
mask = labels == label
plt.scatter(X_pca[mask, 0], X_pca[mask, 1], alpha=0.3, s=5, label=label)
plt.xlabel(f"PC1 ({pca2.explained_variance_ratio_[0]*100:.1f}%)")
plt.ylabel(f"PC2 ({pca2.explained_variance_ratio_[1]*100:.1f}%)")
plt.title("PCA projection colored by crash type")
plt.legend()
plt.show()
# Color PCA projection by first_crash_type (aligned to X index)
labels_first = df_csv_data_cleaned.reset_index().loc[X.index, "most_severe_injury"]
plt.figure(figsize=(8, 6))
for label in labels_first.unique():
mask_fc = labels_first == label
plt.scatter(X_pca[mask_fc, 0], X_pca[mask_fc, 1], alpha=0.3, s=5, label=label)
plt.xlabel(f"PC1 ({pca2.explained_variance_ratio_[0]*100:.1f}%)")
plt.ylabel(f"PC2 ({pca2.explained_variance_ratio_[1]*100:.1f}%)")
plt.title("PCA projection colored by most severe injury")
plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left", fontsize=6)
plt.tight_layout()
plt.show()
t-SNE embedding¶
from sklearn.manifold import TSNE
# t-SNE Hyperparameter Justification:
# -----------------------------------
# perplexity=30: Balances local vs global structure. Typical range is 5-50.
# - Lower perplexity focuses on local neighborhoods
# - Higher perplexity captures more global structure
# - 30 is a common default that works well for medium-to-large datasets
#
# n_iter=1000: Sufficient iterations for convergence (default)
#
# learning_rate='auto': Automatically set to N/early_exaggeration/4 (recommended)
#
# random_state=42: For reproducibility
#
# Sampling: t-SNE has O(N²) complexity, so we sample 10,000 points
# for computational tractability while preserving structure
# Sample data for t-SNE (full dataset is too large)
sample_size = 10000
np.random.seed(42)
sample_idx = np.random.choice(len(X_scaled), size=sample_size, replace=False)
X_sample = X_scaled[sample_idx]
# Fit t-SNE with justified hyperparameters
tsne = TSNE(
n_components=2,
perplexity=30, # Balance between local/global structure
max_iter=1000, # Standard iterations for convergence
learning_rate='auto',
random_state=42
)
X_tsne = tsne.fit_transform(X_sample)
print(f"t-SNE completed. Final KL divergence: {tsne.kl_divergence_:.4f}")
# Plot t-SNE embedding
plt.figure(figsize=(8, 6))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], alpha=0.3, s=5)
plt.xlabel("t-SNE 1")
plt.ylabel("t-SNE 2")
plt.title(f"t-SNE projection (perplexity=30, n={sample_size})")
plt.show()
# Color t-SNE by crash_type
labels_sample = df_csv_data_cleaned.reset_index().loc[X.index[sample_idx], "crash_day_of_week"].values
plt.figure(figsize=(8, 6))
for label in np.unique(labels_sample):
mask = labels_sample == label
plt.scatter(X_tsne[mask, 0], X_tsne[mask, 1], alpha=0.3, s=5, label=label)
plt.xlabel("t-SNE 1")
plt.ylabel("t-SNE 2")
plt.title("t-SNE projection colored by crash_day_of_week")
plt.legend()
plt.show()
# Color t-SNE by most_severe_injury
labels_injury = df_csv_data_cleaned.reset_index().loc[X.index[sample_idx], "most_severe_injury"].values
plt.figure(figsize=(8, 6))
for label in np.unique(labels_injury):
mask = labels_injury == label
plt.scatter(X_tsne[mask, 0], X_tsne[mask, 1], alpha=0.3, s=5, label=label)
plt.xlabel("t-SNE 1")
plt.ylabel("t-SNE 2")
plt.title("t-SNE projection colored by most severe injury")
plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left", fontsize=6)
plt.tight_layout()
plt.show()
# Compare different perplexity values to show effect on embedding
fig, axes = plt.subplots(1, 5, figsize=(18, 4))
perplexities = [10, 20, 30, 50, 80]
for ax, perp in zip(axes, perplexities):
tsne_test = TSNE(
n_components=2,
perplexity=perp,
max_iter=1000,
learning_rate='auto',
random_state=42
)
X_tsne_test = tsne_test.fit_transform(X_sample)
ax.scatter(X_tsne_test[:, 0], X_tsne_test[:, 1], alpha=0.3, s=3)
ax.set_xlabel("t-SNE 1")
ax.set_ylabel("t-SNE 2")
ax.set_title(f"Perplexity = {perp}\nKL div: {tsne_test.kl_divergence_:.3f}")
plt.suptitle("Effect of perplexity on t-SNE embedding", y=1.02)
plt.tight_layout()
plt.show()
# Interpretation:
# - Perplexity 10: Emphasizes very local structure, may fragment clusters
# - Perplexity 30: Balanced view, captures both local and some global structure
# - Perplexity 50: More global view, clusters may merge but overall shape preserved
UMAP embedding¶
import umap
# UMAP Hyperparameter Justification:
# -----------------------------------
# n_neighbors=15: Controls local vs global structure balance
# - Lower values (5-10): Focus on very local structure, may fragment data
# - Higher values (50+): Emphasize global structure, lose local detail
# - 15 is a common default balancing local manifold structure with global topology
#
# min_dist=0.1: Controls how tightly points cluster
# - Lower values (0.0-0.1): Allows tighter clustering, better for finding clusters
# - Higher values (0.5-1.0): Spreads points more evenly, preserves topological structure
# - 0.1 allows clusters to form while maintaining some separation
#
# metric='euclidean': Standard distance metric for standardized numerical data
#
# random_state=42: For reproducibility
#
# n_epochs=500: Sufficient for convergence on medium datasets
#
# Using same sample as t-SNE for fair comparison
# Fit UMAP with justified hyperparameters
umap_model = umap.UMAP(
n_neighbors=15, # Balance local/global structure
min_dist=0.1, # Allow tight clustering
n_components=2,
metric='euclidean',
random_state=42,
n_epochs=500
)
X_umap = umap_model.fit_transform(X_sample)
# Plot UMAP embedding
plt.figure(figsize=(8, 6))
plt.scatter(X_umap[:, 0], X_umap[:, 1], alpha=0.3, s=5)
plt.xlabel("UMAP 1")
plt.ylabel("UMAP 2")
plt.title(f"UMAP projection (n_neighbors=15, min_dist=0.1, n={sample_size})")
plt.show()
# Color UMAP by crash_type
plt.figure(figsize=(8, 6))
for label in np.unique(labels_sample):
mask = labels_sample == label
plt.scatter(X_umap[mask, 0], X_umap[mask, 1], alpha=0.3, s=5, label=label)
plt.xlabel("UMAP 1")
plt.ylabel("UMAP 2")
plt.title("UMAP projection colored by crash_day_of_week")
plt.legend()
plt.show()
# Color UMAP by most_severe_injury
plt.figure(figsize=(8, 6))
for label in np.unique(labels_injury):
mask = labels_injury == label
plt.scatter(X_umap[mask, 0], X_umap[mask, 1], alpha=0.3, s=5, label=label)
plt.xlabel("UMAP 1")
plt.ylabel("UMAP 2")
plt.title("UMAP projection colored by most severe injury")
plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left", fontsize=6)
plt.tight_layout()
plt.show()
# Compare effect of n_neighbors on UMAP embedding
fig, axes = plt.subplots(1, 4, figsize=(16, 4))
n_neighbors_values = [5, 15, 30, 50]
for ax, n_neigh in zip(axes, n_neighbors_values):
umap_test = umap.UMAP(
n_neighbors=n_neigh,
min_dist=0.1,
n_components=2,
random_state=42,
n_epochs=500
)
X_umap_test = umap_test.fit_transform(X_sample)
ax.scatter(X_umap_test[:, 0], X_umap_test[:, 1], alpha=0.3, s=3)
ax.set_xlabel("UMAP 1")
ax.set_ylabel("UMAP 2")
ax.set_title(f"n_neighbors = {n_neigh}")
plt.suptitle("Effect of n_neighbors on UMAP embedding (min_dist=0.1)", y=1.02)
plt.tight_layout()
plt.show()
# Interpretation:
# - n_neighbors=5: Very local focus, may over-fragment the data
# - n_neighbors=15: Good balance of local detail and global connectivity
# - n_neighbors=30: More global structure, clusters begin to merge
# - n_neighbors=50: Emphasizes global topology, local details smoothed out
# Compare effect of min_dist on UMAP embedding
fig, axes = plt.subplots(1, 4, figsize=(16, 4))
min_dist_values = [0.0, 0.1, 0.25, 0.5]
for ax, m_dist in zip(axes, min_dist_values):
umap_test = umap.UMAP(
n_neighbors=15,
min_dist=m_dist,
n_components=2,
random_state=42,
n_epochs=500
)
X_umap_test = umap_test.fit_transform(X_sample)
ax.scatter(X_umap_test[:, 0], X_umap_test[:, 1], alpha=0.3, s=3)
ax.set_xlabel("UMAP 1")
ax.set_ylabel("UMAP 2")
ax.set_title(f"min_dist = {m_dist}")
plt.suptitle("Effect of min_dist on UMAP embedding (n_neighbors=15)", y=1.02)
plt.tight_layout()
plt.show()
# Interpretation:
# - min_dist=0.0: Tightest clustering, points packed densely in cluster cores
# - min_dist=0.1: Slightly relaxed, clusters visible with some internal structure
# - min_dist=0.25: More spread out, easier to see point density variations
# - min_dist=0.5: Uniform spread, preserves more of the topological structure